<!DOCTYPE html>
<html lang="en">

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Optimizer deficiencies</title>
<link rel="stylesheet" type="text/css" href="https://gcc.gnu.org/gcc.css" />
</head>

<body>
<h1>Deficiencies of GCC's optimizer</h1>

<p>This page lists places where GCC's code generation is suboptimal.
Although the examples are small, the problems are usually quite deep.</p>

<p>Note: unless otherwise specified, all examples have been compiled
with the current CVS tree as of the date of the example, on x86, with
<code>-O2 -fomit-frame-pointer -fschedule-insns</code>.  (The x86 back
end disables <code>-fschedule-insns</code>, which is something that
should be revisited, because it almost always gives better code when I
turn it back on.)</p>


<!-- table of contents start -->
<h1 id="toc">Table of Contents</h1>
<ul>
<li><a href="#putting_constants_in_special_sections">Putting constants in special sections</a></li>
<li><a href="#un_cse">Un-CSE</a></li>
<li><a href="#clean_up_how_cse_works">Clean up how cse works</a></li>
<li><a href="#loop_optimization">Loop optimization</a></li>
<li><a href="#using_constraints_on_values">Using constraints on values</a></li>
<li><a href="#change_the_type_of_a_variable">Change the type of a variable</a></li>
<li><a href="#better_handling_for_very_sparse_switches">Better handling for very sparse switches</a></li>
<li><a href="#order_of_subexpressions">Order of subexpressions</a></li>
<li><a href="#better_builtin_string_functions">Better builtin string functions</a></li>
<li><a href="#data_prefetch">Data prefetch support</a></li>
<li><a href="#target-specific">Target specific optimizer deficiencies</a></li>
</ul>
<!-- table of contents end -->

<hr />
<h2 id="putting_constants_in_special_sections">Putting constants in special sections.</h2>

<p>If a function has been placed in a special
section via attributes, we may want to put its static data and string
constants in a special section too.  But which one?  (Being able to
specify a section for string constants would be useful for the Linux
kernel.)</p>

<hr />
<h2 id="un_cse">Un-cse.</h2>

<p>Perhaps we should have an un-cse step right after cse, which tries to
replace a reg with its value if the value can be substituted for the
reg everywhere, if that looks like an improvement.  Which is if the
reg is used only a few times.  Use rtx_cost to determine if the
change is really an improvement.</p>

<hr />
<h2 id="clean_up_how_cse_works">Clean up how cse works.</h2>

<p>The scheme is that each value has just one hash entry.  The
first_same_value and next_same_value chains are no longer needed.</p>

<p>For arithmetic, each hash table elt has the following slots:</p>
<ul>
  <li>Operation.  This is an rtx code.</li>
  <li>Mode.</li>
  <li>Operands 0, 1 and 2.  These point to other hash table elements.</li>
</ul>

<p>So, if we want to enter <code>(plus:SI (reg:SI 30) (const_int
104))</code>, we first enter <code>(const_int 104)</code> and find the
entry that <code>(reg:SI 30)</code> now points to.  Then we put these
elts into operands 0 and 1 of a new elt.  We put PLUS and
SI into the new elt.</p>

<p>Registers and mem refs would never be entered into the table as
such.  However, the values they contain would be entered.  There would
be a table indexed by regno which points at the hash entry for the
value in that reg.</p>

<p>The hash entry index now plays the role of a qty number.  We still
need qty_table and reg_eqv_table to record which regs share a
particular qty.</p>

<p>When a reg is used whose contents are unknown, we need to create a
hash table entry whose contents say "unknown", as a place holder for
whatever the reg contains.  If that reg is added to something, then
the hash entry for the sum will refer to the "unknown" entry.  Use
UNKNOWN for the rtx code in this entry.  This replaces make_new_qty.</p>

<p>For a constant, a unique hash entry would be made based on the
value of the constant.</p>

<p>What about MEM?  Each time a memory address is referenced, we need
a qty (a hash table elt) to represent what is in it.  (Just as for a
register.)  If this isn't known, create one, just as for a reg whose
contents are unknown.</p>

<p>We need a way to find all mem refs that still contain a certain
value.  Do this with a chain of hash elts (for memory addresses) that
point to locations that hold the value.  The hash elt for the value
itself should point to the start of the chain.  It would be good for
the hash elt for an address to point to the hash elt for the contents
of that address (but this ptr can be null if the contents have never
been entered).</p>

<p>With this data structure, nothing need ever be invalidated except
the lists of which regs or mems hold a particular value.  It is easy
to see if there is a reg or mem that is equiv to a particular value.
If the value is constant, it is always explicitly constant.</p>

<hr />
<h2 id="loop_optimization">Loop optimization</h2>

<p>Strength reduction and iteration variable elimination could be
smarter.  They should know how to decide which iteration variables are
not worth making explicit because they can be computed as part of an
address calculation.  Based on this information, they should decide
when it is desirable to eliminate one iteration variable and create
another in its place.</p>

<p>It should be possible to compute what the value of an iteration
variable will be at the end of the loop, and eliminate the variable
within the loop by computing that value at the loop end.</p>

<hr />
<h2 id="using_constraints_on_values">Using constraints on values</h2>

<p>Many operations could be simplified based on knowledge of the
minimum and maximum possible values of a register at any particular
time.  These limits could come from the data types in the tree, via
rtl generation, or they can be deduced from operations that are
performed.  For example, the result of an <code>and</code> operation
one of whose operands is 7 must be in the range 0 to 7.  Compare
instructions also tell something about the possible values of the
operand, in the code beyond the test.</p>

<p>Value constraints can be used to determine the results of a further
comparison.  They can also indicate that certain <code>and</code>
operations are redundant.  Constraints might permit a decrement and
branch instruction that checks zeroness to be used when the user has
specified to exit if negative.</p>

<hr />
<h2 id="change_the_type_of_a_variable">Change the type of a variable</h2>

<p>Sometimes a variable is declared as <code>int</code>, it is
assigned only once from a value of type <code>char</code>, and then it
is used only by comparison against constants.  On many machines,
better code would result if the variable had type <code>char</code>.
If the compiler could detect this case, it could change the
declaration of the variable and change all the places that use it.</p>

<hr />
<h2 id="better_handling_for_very_sparse_switches">Better handling for very sparse switches</h2>

<p>There may be cases where it would be better to compile a switch
statement to use a fixed hash table rather than the current
combination of jump tables and binary search.</p>

<hr />
<h2 id="order_of_subexpressions">Order of subexpressions</h2>

<p>It might be possible to make better code by paying attention to the
order in which to generate code for subexpressions of an expression.</p>

<hr />
<h2 id="distributive_law">Distributive law</h2>

<p>The C expression <code>*(X + 4 * (Y + C))</code> compiles better on
certain machines if rewritten as <code>*(X + 4*C + 4*Y)</code> because
of known addressing modes.  It may be tricky to determine when, and
for which machines, to use each alternative.</p>

<p>Some work has been done on this, in combine.c.</p>

<hr />
<h2 id="better_builtin_string_functions">Better builtin string functions</h2>

<p>Although GCC implements numerous optimizations of the standard C
library's string, math and I/O functions, there are still plenty more
transformations that could be implemented.</p>

<ul>
  <li>glibc has inline assembler versions of various string functions; GCC
  has some, but not necessarily the same ones on the same architectures.
  Additional <code>optab</code> entries, like the ones for <code>ffs</code>
  and <code>strlen</code>, could be provided for several more functions
  including <code>memset</code>, <code>strchr</code>, <code>strcpy</code>
  and <code>strrchr</code>.</li>

  <li>GCC could optimize <code>strcmp</code> (and <code>memcmp</code>)
  where one string is constant to compare successive bytes to known
  constant ones inline.  For lengths longer than about 4, the inline
  comparisons could test the prefix before calling the library function
  (or inline a suitable instruction) for the remainder of the strings.</li>

  <li>glibc optimizes <code>strncpy</code> from a string constant, where
  the maximum length is constant and greater than the length of the
  string constant including the terminating null character, by using
  an optimization not implemented in GCC.  Its makes use of its own
  <code>__mempcpy</code> function to copy and return the address after
  the data copied, to pass into <code>memset</code>.  This could be
  implemented if GCC provided a <code>__builtin_mempcpy</code> function.</li>

  <li>Similarly, GCC could transform a call to <code>strcat</code> into
  a call to <code>strchr</code> followed by a <code>strcpy</code> or
  <code>memcpy</code>.</li>

  <li>glibc optimizes <code>strcspn</code>, <code>strcspn</code> and
  <code>strpbrk</code> where the string of characters that need to be
  matched is constant and of length not greater than three.</li>
</ul>

<p>The GNU libc also currently contains macros to optimize calls to
some string functions with constant arguments and those that can be
implemented by processor specific instructions.  These transformations
are better performed in GCC, both to reduce the overhead of macro
expansion and to take advantage of the functions attributes, for
example to avoid a second call to a pure function altogether.  The
use of these macros tend to cause huge blowup in the size of preprocessed
source if nested; for example, each nested call to <code>strcpy</code>
expands the source 20-fold, with four nested calls having an expansion
ten megabytes in size.  GCC then consumes a huge amount of memory
compiling such expressions.  The remaining optimizations need to be
implemented in GCC and then disabled in glibc, with benefits to other
systems as well, and the potential to use information GCC has about
alignment.</p>

<p>All the string functions act as if they access individual
characters, so care may need to be taken that no
<code>-fstrict-aliasing</code> problems occur when internal uses of
other types are generated.  Also, the arguments to the string function
must be evaluated exactly once each (if they have any side effects),
even though the call to the string function might be optimized away.</p>

<p>Care must be taken that any optimizations in GCC are
standards-conforming in terms of not possibly accessing beyond the
arrays involved (possibly within a function call created in the
optimization); whereas the glibc macros know the glibc implementation
and how much memory it might access, GCC optimizations can't.</p>

<p>There are some further optimizations in glibc not covered here,
that either optimize calls to non-ISO C functions or call glibc
internal functions in the expansion of the macro.</p>

<p>Many of these optimizations should not be applied if
<code>-Os</code> is specified.</p>


<hr />
<h2 id="data_prefetch">Data prefetch support</h2>

<p>Loads from memory can take many cycles if the loaded data is not in
a cache line.  Cache misses can bring a CPU to a halt for several 100
cycles, so minimizing cache misses is one of the most important jobs
of a modern compiler.  Data prefetch instructions are one tool the
compiler can use for this.</p>

<p>Data prefetch instructions allow a compiler to minimize cache-miss
latency by loading data into a cache before it it accessed.  A separate
page describes
<a href="prefetch.html">data prefetch support and optimizations</a> that
are in development or already supported by GCC.</p>

<hr />
<h2 id="target-specific">Target specific optimizer deficiencies</h2>

<p>Almost all code transformations implemented in GCC are target
independent by design, but how well they work depends on how accurately
an architecture is modelled.  For example, combine may fail to combine
some instructions if a machine description does not provide a pattern,
or if the pattern exists but the predicates are too strong.
Another example is the target cost function which is used to determine
if a transformation is profitable.  And sometimes a transformation that
some specific target could benefit from is simply missing in the target
independent implementation.</p>

<p>Optimization deficiencies like these are listed per target in the
target specific pages below.  An issue listed for one target may still
also be an issue for other targets, but problems in the target specific
pages are problems with, and ideas for the machine descriptions of some
target.</p>

<ul>
<li><a href="x86.html">Optimization deficiencies for ia32 and x86-64 based architectures</a></li>
<li><a href="ia64.html">Projects to improve performance on IA-64</a></li>
</ul>

<p><strong>Note:</strong> If an issue listed in a target specific issues
page, but it clearly is a target indepentent issue, please move it to
a page discussing target indepentent issues</p>

</body>
</html>

